11 Importing Data into R (2.3)

11.1 Learning Outcomes

By the end of this tutorial, you should:

understand the most common formats in which data is stored
be able to import datasets in various formats into R, ready for preparation and analysis

11.2 Data Formats

When working in R, we often begin by importing data that is stored in a format other than R’s native storage format. We might also need to export our dataframe or tibble in a format that can be sent to another analyst, of for backup purposes.

It is important therefore to understand the most common formats in which data is stored, and be aware of their strengths and weaknesses.

11.2.1 CSV (Comma-Separated Values)

CSV is a simple, plain-text format where data is organized in rows, with each column separated by a comma. CSV files are very widely used due to their simplicity and ease of use. They can be read and written by many programs, including Microsoft Excel, Numbers, and Google Sheets.

CSV files are easy to read and write. It’s usually quite straightforward to open a .csv file and understand what it contains. They’re widely supported by various software programmes and languages. Importantly, in an era of ‘big data’, they are generally smaller and more portable in terms of file size, compared to (for example) Excel files.

Disadvantages include limited support for complex data structures, no built-in data type support, and no support for multiple sheets like we find in Excel.

11.2.2 Excel (XLS and XLSX)

Microsoft Excel files (XLS and XLSX) are a popular choice for data storage and analysis, especially for smaller datasets and non-technical users.

Excel files offer built-in data types, formulas, formatting, and support for multiple sheets within a single file. You’ll frequently encounter Excel being used to create and store data within sports settings.

Excel has a number of strengths, including support for complex data structures and data types, built-in formulas and functions, and widespread usage and familiarity. There are also lots of online tutorials on using Excel, including some quite complex data processing.

Some disadvantages include it being a proprietary format, requiring specific software to access. It tends to create larger file size compared to CSV, and the format has limited support in some programming languages.

Note

One of the major disadvantages of working in Excel is a lack of transparency and replicability in data cleaning and preparation. For example, it’s hard to maintain a record of which variables you’ve deleted, or where you’ve changed a missing value.

For that reason, I would recommend avoiding Excel if possible, unless it’s just being used for some basic data viewing.

11.2.3 JSON (JavaScript Object Notation)

JSON is a lightweight, text-based data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is often used to transmit data between a server and a web application or for data storage. JSON is a popular format for working with APIs and web services.

If you’re unfamiliar with JSON, you can see some examples of it here.

Some advantages of JSON is that it is human-readable and easy to understand. It supports complex and nested data structures, and is widely supported by various programming languages.

Some disadvantages are that, like Excel, files in this format tend to have a larger file size compared to CSV, and it offers no support for tabular data structure out-of-the-box.

11.2.4 XML (eXtensible Markup Language)

XML is a markup language designed for encoding data in a format that is both human-readable and machine-readable. XML is often used for data exchange between systems and applications. It can represent complex hierarchical data structures and allows for custom tags and attributes.

Some advantages of XML is that it is human-readable, highly flexible and extensible, and it supports complex and nested data structures

Some disadvantages are that it generates larger file sizes compared to CSV and JSON. It also has a more complex syntax compared to JSON, and is less intuitive for data analysis tasks.

You can compare some data in both JSON and XML format here.

11.2.5 Parquet, Avro, and ORC

You may be unfamiliar with these data types. They are binary file formats optimized for big data processing and are commonly used in Hadoop and Spark ecosystems.

These formats offer a number of advantages over others, including columnar storage, data compression, and schema evolution capabilities. These provide improved performance and storage efficiency for large-scale data processing, and this is the advantage of these formats.

However, they have limited support in non-big data software tools and languages, not being in binary format are not human-readable. It is unlikely you will use them within the MSc programme, though certain placement providers may be using them within big-data applications.

11.3 Packages for Importing Data

As you might expect, there are a number of useful packages in R that are designed specifically for the purpose of importing raw data.

Tip

Just a reminder – packages are additional libraries that you can install and run within R. They provide additional functionality over and above ‘base’ R. Remember, you only need to install a package once for each computer you use. You DO need to call it (using the ‘library’ command) each time you wish to use it in a session.

Here are some of the most common packages that are used for data import. Their names give you a clue about which format they can be used for!

tidyverse
readxl
jsonlite
RMySQL

11.3.1 Working with CSV Data

code-fold: false
library(tidyverse)
csv_data <- read_csv("path/to/your/csv_file.csv")

11.3.2 Working with Excel Data

library(readxl)
excel_data <- read_excel("path to your excel_file.xlsx")

11.3.3 Working with JSON Data

library(jsonlite)
json_data <- fromJSON("path/to/your/json_file.json")

11.4 Working with Databases

Databases are data files that are stored remotely, and are usually accessed via the internet. You may find yourself working with sport data that is stored centrally using this process.

11.4.1 Connecting using dbConnect()

First, you need to establish a connection to the database as follows:

library(RMySQL)
connection <- dbConnect(MySQL(), host = "localhost", user = "username", password = "password", dbname = "database_name")

11.4.2 Importing from a database

Once you have connected with the database, the process for importing the data is similar to those used previously in this tutorial.

table_data <- dbReadTable(connection, "table_name")

11.4.3 Closing using dbDisconnect()

You can then close your connection to the database, as you now have a local copy of the dataset.

dbDisconnect(connection)

There is more on accessing online databases here.

11.5 Practical Activity

Refer to the webpage here. Work through the activities listed, including how to determine your current working directory, which will be important in the future.